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Abstract 

In order to assess the effect of a health care intervention, it is useful to look 
at an ensemble of relevant studies. The Cochrane Collaboration’s admirable 
goal is to provide systematic reviews of all relevant clinical studies, in order to 
establish whether or not there is a conclusive evidence about a specific interven¬ 
tion. This is done mainly by conducting a met a-analysis: a statistical synthesis 
of results from a series of systematically collected studies. Health practition¬ 
ers often interpret a significant meta-analysis summary effect as a statement 
that the treatment effect is consistent across a series of studies. However, the 
meta-analysis significance may be driven by an effect in only one of the studies. 
Indeed, in an analysis of two domains of Cochrane reviews we show that in a 
non-negligible fraction of reviews, the removal of a single study from the meta¬ 
analysis of primary endpoints makes the conclusion non-significant. Therefore, 
reporting the evidence towards replicability of the effect across studies in addi¬ 
tion to the significant meta-analysis summary effect will provide credibility to 
the interpretation that the effect was replicated across studies. We suggest an 
objective, easily computed quantity, we term the r-value, that quantifies the 
extent of this reliance on single studies. We suggest adding the r-values to the 
main results and to the forest plots of systematic reviews. 
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1 Introduction 


In systematic reviews, several studies that examine the same questions are analyzed 
together. Viewing all the information is extremely valuable for practitioners in the 
health sciences. A notable example is the Cochrane systematic reviews on the effects 
of healthcare interventions. The process of preparing and maintaining Cochrane 
systematic reviews is described in detail in their manual [Higgins et ah, 2011 . The 


reviews attempt to assemble all the evidence that is relevant to a specific healthcare 
intervention. 


Deriving conclusions about the overall health benefits or harms from an ensemble 
of studies can be difficult, since the studies are never exactly the same and there 
is danger that these differences affect the inference. For example, factors that are 
particular to the study, such as the specific cohorts in the study that are from specific 
populations exposed to specific environments, the specific experimental protocol used 
in the study, the specific care givers in the study, etc., may have an impact on the 
treatment effect. 

A desired property of a systematic review is that the effect has been observed in 
more than one study, i.e., the overall conclusion is not entirely driven by a single 
study. If a significant meta-analysis finding becomes non-significant by leaving out 
one of the studies, this is worrisome for two reasons: first, the finding may be too 
particular to the single study (e.g., the specific age group in the study); second, there 
is greater danger that the significant meta-analysis finding is due to bias in the single 
study (e.g., due to improper randomization or blindness). We view this problem as a 
replicability problem: the conclusion about the significance of the effect is completely 
driven by a single study, and thus we cannot rule out the possibility that the effect 
is particular to the single study, i.e., that the effect was not replicated across studies. 

A replicability claim is not merely a vague description. A precise computation of the 
extent of replicability is possible. An objective way to quantify the evidence that the 
meta-analytic findings do not rely on single studies is as follows. For a meta-analysis of 
several studies (N studies), the minimal replicability claim is that results have been 
replicated in at least two studies. This claim can be asserted if the meta-analysis 
results remains significant after dropping (leaving-out) any single study. We suggest 
accompanying the review with a quantity we term the r-value, which quantifies the 
evidence towards replicability of the effects across studies. The r-value is the largest 
of these N meta-analysis p-values. Like a p-value, which quantifies the evidence 
against the null hypothesis of no effect, the r-value quantifies the evidence against 
no replicability of effects. The smaller the r-value, the greater the evidence that the 
conclusion about a primary outcome is not driven by a single study. 

The report of the r-value is valuable for meta-analyses of narrow scope as well as of 
broad scope. In Section 5.6 of the manual [Higgins et ah, 20Tl] the scope of the review 
question is addressed. If the scope is broad, then a review that produced a single meta- 
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analytic conclusion may be criticized for ‘mixing apples and oranges’, particularly 
when good biologic or sociological evidence suggests that various formulations of 
an intervention behave very differently or that various definitions of the condition 
of interest are associated with markedly different effects of the intervention. The 
advantage of a broad scope is that it can give a comprehensive summary of evidence. 
The narrow scope is more manageable, but the evidence may be sparse, and findings 
may not be generalizable to other settings or populations. If the r-value is large 
(say above 0.05) for a meta-analyses with a narrow scope, this is worrisome since 
the scope has already been selected, and the large r-value indicates that an even 
stricter selection that removes one single additional study can change the significant 
conclusion. If the r-value is large for a meta-analyses with a broad scope, this is 
worrisome since the reason for the significant finding may be the single “orange” 
among the several (null) “apples”. 

We examined the extent of the replicability problem in systematic reviews. We found 
that there may be lack of replicability in a large proportion of studies. In Section 
we show that out of the 21 reviews with a significant meta-analysis result on the most 
important outcomes of interest published on breast cancer, 13 reviews were sensitive 
to leaving one study out of the meta-analysis. The problem was less pronounced 
in the reviews published on influenza, where 2 reviews were sensitive to leaving one 
study out of the meta-analysis, out of 6 updated reviews with significant primary 
outcomes. 

[Anzures-Carbera and Higgins, 201^ write that a useful sensitivity analysis is one 
in which the meta-analysis is repeated, each time omitting one of the studies. A 
plot of the results of these meta-analysis, called an ‘exclusion sensitivity plot’ by 
|Bax et al, 2006| , will reveal any studies that have a particularly large influence on the 
results of the meta-analysis. In this work, we concur with this view, but recommend 
the most relevant single number of summary information of such a sensitivity analysis 
be added to the report of the main results, and to the forest plot, of the meta-analysis. 
The code for the computation of the r-values and sensitivity intervals is available from 
the first author upon request. 


2 The lack of replicability in systematic reviews 


We took all the updated reviews in two domains: breast cancer and influenza. Our 
eligibility criteria were as follows: (a) the review included forest plots; (b) at least 
one primary outcome was reported as significant at the .05 level, which is the default 
significant level used in Cochrane Reviews; (c) the meta-analysis of at least one of 
the primary outcomes was based on at least three studies and (d) there was no 
reporting in the review of unreliable/biased primary outcomes or poor quality of 
available evidence. We consider as primary outcomes the outcomes that were defined 
as primary by the review authors, and if none were defined we selected the most 
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important findings from the review summaries and treated the outcomes for these 
findings as primary. We limit ourselves to meta-analyses that include at least three 
studies , since this is the minimum number of studies for which even if the single 
studies are not significant the meta-analysis may still be non-sensitive (i.e., that a 
meta analysis based on every subset of two studies can have a significant finding). 

In Cochrane reviews, the meta-analyses are of two types: fixed effect and random 
effects. Under the fixed effect model all studies in the meta-analysis are assumed to 
share a common (unknown) effect 9. Since all studies share the same effect, it follows 
that the observed effect varies from one study to the next only because of the random 
error inherent in each study. The summary effect is the estimate of this common effect 
9. Under the random effects model the effects in the studies, 9i , i = 1,2, ...,N, are 
assumed to have been sampled from a distribution with mean 9. Therefore, there are 
two sources of variance: the within-study error in estimating the effect in each study 
and the variance in the true effects across studies. The summary effect is the estimate 
of the effects distribution mean 9. For details on estimation of these effects and their 
confidence intervals, see [Higgins et ah, 20TT] . In this Section our results are based on 
the computations of the meta-analysis p-values as suggested in [Higgins et ah, 20TT] , 
for both fixed and random effects meta-analyses. 

In the breast cancer domain 48 updated reviews were published by the Cochrane 
Breast Cancer Group in the Cochrane library, out of which we analyzed 21 updated 
reviews that met our eligibility criteria (14, 8 , 4 and 1 reviews was excluded due 
reasons a, b, c and d respectively). Out of the 21 eligible reviews, 13 reviews were 
sensitive to leaving one study out in at least one primary outcome. Moreover, in 8 
out of 13 reviews all the significant primary outcomes were sensitive. The prevalence 
of sensitive meta-analyses was similar among the fixed effect and random effect meta¬ 
analyses, see Table [TJ Among the 15 fixed effect meta-analyses, 6 reviews where 
sensitive in all their primary outcomes, 2 reviews were sensitive in 66% of the primary 
outcomes, 1 review was sensitive in 50% of the primary outcomes, and 6 reviews 
were not sensitive in any of their primary outcomes. Among the 7 Random effect 
meta-analyses, 3 reviews were sensitive in all their primary outcomes, 2 review were 
sensitive in 50% of their primary outcomes, and 2 reviews were not sensitive in any 
of their primary outcomes. 

In the infiuenza domain 25 reviews were published by different groups (e.g., Cochrane 
Acute Respiratory Infections Group, Cochrane Childhood Cancer Group etc.) in the 
Cochrane library, out of which we analyzed 6 updated reviews that met our eligibility 
criteria (9, 2 , 7 and 1 review was excluded due reasons a, b, c and d respectively). 
Our results are summarized in Table Out of the 6 eligible reviews, 2 reviews were 
sensitive to leaving 1 study out. Among the two fixed effect meta-analyses reviews, 
one review was sensitive in all primary outcomes and one review was not sensitive 
in all primary outcomes. Among the five reviews with random effect meta-analyses, 
1 review was sensitive in 40% of the primary outcomes, and four reviews were not 
sensitive in any of their outcomes. 
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Table 1: Table of results for the breast cancer domain. The review name (column 
2);the type of meta-analysis (column 3); the number of significant primary outcomes 
(column 4); the number of outcomes with r-values at most (0.01,0.05,0.1) (columns 
5,6,7); the actual r-values of the primary outcomes, arranged in increasing order 
(column 8).The smaller the r-value, the stronger the evidence towards replicability. 
The rows are arranged by order of increasing sensitivity; the last 8 rows are sensitive 
in all primary outcomes. 



Review 

number of number of outcomes 
significant non-sensitive at level 

Random/Fixed outcomes 0.01 0.05 0.1 r-values 

1 

CD004421 

Fixed 

2 

2 

2 

2 

(1.300e-10, 1.405e-07) 

2 

CD003372 

Fixed 

2 

2 

2 

2 

(4.000e-14, 5.368e-05) 

3 

CD002943 

Fixed 

2 

2 

2 

2 

(9.853e-09, 0.0012) 

4 

CD006242 

Random 

2 

2 

2 

2 

(2.580e-09, 0.03549) 

5 

CD000563 

Fixed 

1 

1 

1 

1 

1.28e-ll 

6 

CD008941 

Fixed 

1 

1 

1 

1 

1.025e-04 

7 

CD005001 

Random 

1 

0 

1 

1 

0.0335 

8 

CD003370 

Fixed 

1 

0 

1 

1 

0.05341 

9 

CD003367 

Fixed 

2 

1 

1 

1 

(7.440e-07, 0.18) 

10 

CD005211 

Random 

4 

1 

2 

2 

(0.0017, 0.0167, 0.1231, 0.178) 

11 

CD003474 

Random 

2 

1 

1 

1 

(0.0463, 0.253) 

12 

CD003366 

Fixed 

3 

1 

1 

1 

(3.200e-05, 0.21, 0.38) 

13 

CD003139 

Fixed 

3 

1 

1 

1 

(0.1053, 0.1852, 0.002) 

14 

CD006823 

Random 

1 

0 

0 

1 

0.08 

15 

CD004253 

Random 

1 

0 

0 

0 

0.1028 

16 

CD005002 

Fixed 

1 

0 

0 

0 

0.15 

17 

CD008792 

Fixed 

1 

0 

0 

0 

0.24 

28 

CD007077 

Fixed 

1 

0 

0 

0 

0.3 

19 

CD002747 

Fixed 

1 

0 

0 

0 

0.9641 

20 

CD007913 

(Random,Fixed) 2 

0 

0 

0 

(0.0712,0.0756) 

21 

CD003142 

Fixed 

2 

0 

0 

0 

(0.1243,0.1827) 


The influenza domain has a much smaller number of reviews with significant primary 
results than the breast cancer domain. In the influenza domain, most of the reviews 
have non-significant endpoints or low quality of evidence. 


3 Calculating and reporting the r-value: examples 


In this section we shall give examples of sensitive and non-sensitive (fixed and ran¬ 
dom effect) meta-analyses in the breast cancer domain. For examples in the influenza 
domain, see Appendix]^ For each example, we shall compute the r-value, which is 
based on the N leave-one out meta-analysis p-values, as well as the sensitivity inter¬ 
val, which is the union of these N meta-analysis confidence intervals. The detailed 
computations are given in Appendix]^ We shall show how to incorporate these new 
quantities in the Cochrane reviews’ abstract and forest plots . 

Our first example is based on a meta-analysis in review CD006242, analyzed by 
the authors as a random effect meta-analysis, which is non-sensitive and thus has 
a small r-value. The objective of review CD006242 was to assess the efficacy of 
therapy with Trastuzumab in women with HER2-positive metastatic breast cancer. 
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Table 2: Table of results for the influenza domain. The review name (column 2);the 
type of meta-analysis (column 3); the number of signiflcant primary outcomes (column 
4); the number of outcomes with r-values at most (0.01,0.05,0.1) (columns 5,6,7); the 
actual r-values of the primary outcomes, arranged in increasing order (column 8).The 
smaller the r-value, the stronger the evidence towards replicability. The rows are 
arranged by order of increasing sensitivity. The value 0.0001* indicates that the 
r-value was smaller than 0.0001. 



Review 

Random/Fixed 

number of 
significant 
outcomes 

number of outcomes 
non-sensitive at level 
0.01 0.05 0.1 

r-values 

1 

CD001269 

(Fixed,Random) 

4 

4 

4 

4 

(0.0001*, 0.0001*,0.0014, 0.0188) 

2 

CD001169 

Random 

4 

4 

4 

4 

(0.0001* ,0.0014 ,0.0016, 0.0025) 

3 

CD004879 

Random 

4 

4 

4 

4 

(0.0001* ,0.0001* ,0.0007, 0.006) 

4 

CD002744 

Random 

1 

1 

1 

1 

0.0009 

5 

CD008965 

Random 

5 

1 

3 

3 

(0.0001*, 0.0108, 0.0471 ,0.118, 0.1206) 

6 

CD005050 

Fixed 

1 

0 

0 

0 

0.9888 


Only one of the studies was (barely) significant, and the remaining four studies had 
non-significant effects at the .05 significance level. However, when combined in a 
meta-analysis the evidence was highly significant, and the review conclusion was 
that Trastuzumab improved overall survival in HER2-positive women with metastatic 
breast cancer, see the left panel of Figure This is a nice example that shows 
how a meta-analysis can increase power. Even after removing the single signiflcant 
study (study number 5) there was still a signiflcant effect in the meta-analysis at 
the 0.05 level; see the right panel of Figure The r-value is 0.03549 based on the 
meta-analysis computations as suggested in [Higgins et ah, 20TI]. In a recent paper. 


[IntHout et ah, 2014| suggested an alternative random effect meta-analysis, which 
controls the type I error rate more adequately. The r-value is 0.0366 based on the 
meta-analysis computations as suggested in [IntHout et ah, 2014 . 


We suggest accompanying the original forest plot with this r-value, see Figure The 
significant meta-analytic conclusion can therefore be accompanied by a statement 
that the replicability claim is established at the .05 level of significance. This is 
a stronger scientific claim than that of the meta-analysis, and it is supported by 
the data in this example. In the main results of Review CD006242 the authors 
write ’’The combined HRs for overall survival and progression-free survival favoured 
the trastuzumab-containing regimens (HR 0.82, 95% confldence interval (Cl) 0.71 
to 0.94, P = 0.004; moderate-quality evidence)”. To this, we suggest adding the 
following information “This result was replicated in more than one study (r-value = 
0.03549)”. 


Our second example, also from the breast cancer domain, is based on a meta-analysis 
in review CD008792, that was analyzed by the authors as a fixed effect meta-analysis. 
In this example the flxed effect meta-analysis was sensitive. The objective of Review 
CD008792 was to assess the effect of combination chemotherapy compared to the 
same drugs given sequentially in women with metastatic breast cancer. In the meta¬ 
analysis a significant finding was discovered, see the left panel of Figure However, 
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Figure 1: The forest plot in Review CD006242 (Left) and excluding study 5 (Right). 
The r-value was 0.03549. The sensitivity interval, [0.685,0.987], is the confidence 
interval excluding study 5 (the black diamond in the right panel) with an additional 
(very small) left tail. The axis is on the logarithmic scale. 


note that the different studies seem to have different effects. Nevertheless, the review 
conclusion was that the combination arm had a higher risk of progression than the 
sequential arm. After removing study number 7, there was no longer a significant 
effect in the meta-analysis, see the right panel of Figure The r-value was 0.24. 
The replicability claim was not established at the .05 level of significance. This lack 
of replicability, quantified by the r-value, cautions practitioners that the significant 
meta-analysis finding may depend critically on a single study. 

We suggest accompanying the original forest plot with this r-value, see Figure In 
the main results of Review CD008792 the authors write ” The combination arm had 
a higher risk of progression than the sequential arm (HR 1.16; 95% Cl 1.03 to 1.31; P 
= 0.01) with no significant heterogeneity”. To this, we suggest adding the following 
information “We cannot rule out the possibility that this result is based on a single 
study (r-value = 0.24)”. 

In the right panels of Figures [T] and the meta-analysis confidence intervals that 
would have been computed had we considered only this specific subset of studies is 
shown. The sensitivity intervals has an additional tail in the direction favoured by 
the data. 
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Analysis 1 .1 . Comparison I Efficacy of trastuzumab, Outcome I Overall survival - all studies. 
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Figure 2: The forest plot in the original Review CD006242 including the r-value, 
which was 0.03549. The asterisks indicate which study was excluded for the r-value 
computation. 

4 Methodological extensions 

4.1 A lower bound on the extent of replicability 

A review is less sensitive than another review if a larger fraction of studies are excluded 
without reversing the significant conclusions. We can calculate the meta-analysis 
significance not only after dropping each single study, but also after dropping all 
pairs of studies, triplets of studies etc. Each time we calculate the maximum p-value 
and stop at the first time it exceeded a. The bigger the number of studies that can 
be dropped, the stronger the replicability claim. 

For example, the objective of Review CD004421 was to assess the efficacy of therapy 
taxane containing chemotherapy regimens as adjuvant treatment of pre- or post¬ 
menopausal women with early breast cancer. The review included 11 studies, out 
of which only three studies were significant, and the remaining eight studies had 
non-significant effects. When combined in a meta-analysis, the evidence was highly 
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Figure 3: The forest plot in Review CD008792 (Left) and excluding study 7 (Right). 
The r-value was 0.24. The sensitivity interval, [0.94, 1.356], is the confidence interval 
excluding study 7 (the black diamond in the right panel) with an additional (small) 
right tail. The axis is on the logarithmic scale. 


significant, and the review conclusion was that the use of taxane containing adjuvant 
chemotherapy regimens improved the overall survival of women with early breast 
cancer, see Figure 


In order to reverse the significant conclusion, we need to leave out 6 studies: for 
u = 6, the r-value was 0.0281, but for u = 7, the r-value was 0.0628, see Table 3. 
Therefore, with 95% confidence, the true number of studies with an effect (in the 
same direction) is at least 6. More generally, testing in order at significance level 
a , results in a 1-a confidence lower bound on the number of studies with an effect 
in a fixed-effect meta-analysis (see [Heller, 201 1| for proof). Note that although we 
have a lower bound on the number of studies that show an effect, we cannot point 
out to which studies these are. This is so since the pooling of evidence in the same 
direction in several studies increases the lower bound, even though each study on it’s 
own maybe non-significant. 


4.2 Accounting for multiplicity 


When more than one primary endpoint is examined, the r-value needs to be smaller 
in order to establish replicability This is exactly the same logic as with p-values, 
for which we need to lower the significance threshold when faced with multiplicity of 
endpoints. Family-wise error rate (FWER) or false discovery rate (FDR) controlling 
procedures can be applied to the individual r-values in order to account for the 
multiple primary endpoints, see |Benjamini et al, 2009| for details. 
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Figure 4: The forest plot in the original Review CD008792 including the r-value, 
which was 0.24. The asterisks indicate which study was excluded for the r-value 
computation. 


For example, in Review CD005211 four endpoints were examined, with the following 
r-values: (1) 0.1231; (2) 0.0017; (3) 0.0167; (4) 0.1776. For FWER control over 
replicability claims at the 0.05 level, the Bonferroni-adjusted r-values are the number 
of endpoints multiplied by the original r-values. Only endpoint (2) is reported as 
replicated using Bonferroni at the 0.05 level, since it is the only Bonferroni-adjusted 
r-value below 0.05, 4 x 0.0017 < 0.05. 

For FDR control over replicability claims, we can use the Benjamini-Hochberg (BH) 
procedure ( |Benjamini and Hochberg, 1995| ) on the reported r-values. The BH-adjusted 
r-values for a sorted list of M r-values, r(i) < ... < r(M), are niinj>j j = 

1,... ,M. In Review CD005211, the sorted list is (0.0017,0.0167,0.1231,0.1776) and 
the adjusted r-values are (0.0068,0.0334,0.1641,0.1776). Therefore, endoints (2) and 
(3), the two endpoints with the smallest p-values in the sorted list, are reported as 
replicated using FDR at the 0.05 level, since for both endpoints the BH-adjusted 
r-values are below 0.05. 
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Figure 5: The forest plot in the original Review CD004421 including the r-value, 
which was 2.91e-06. The asterisks indicate which study was excluded for the r-value 
computation. 

5 Discussion 


In this work we suggested enhancing the systematic reviews meta-analyses, for both 
fixed effect and random effects model, with a measure that quantifies the strength of 
replicability,i.e., the r-value. In the reporting, if the r-value is small we have evidence 
that the conclusion is based on more than one study, i.e., that the effect was replicated 
across studies. We suggest adding a cautionary note if the r-value is greater than the 
significance level (say = 0.05), that states that the conclusion depends critically on a 
single study. This does not mean that the conclusion is necessarily reversed, but the 
large r-value warrants another examination of the studies in the meta-analysis, and if 
the single study upon which the review relies was very well conducted the conclusion 
may still be justified despite it being only a single study. 

We would like to emphasize that replicability analysis is relevant for both fixed effect 
and random effects model meta analysis. In both cases, the meta-analysis can be 
significant even though the true summary effect is greater than zero in only one 
study out of the N and hence the replicability analysis is needed. Specifically, for the 
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Table 3: The r-value excluding u — 1 studies, for u = 2,..., 7, in Review CD004421. 
This exclusion is in worst-case order, i.e in order of the study that will show the 
highest lower bound. 



u 

r-value 

sensitivity interval 
lower bound 

exclued study 

1 

2 

2.91e-06 

0.89 

BCIRG 001 

2 

3 

1.64e-04 

0.91 

BIG 2-98 

3 

4 

2.29e-03 

0.94 

Taxit 216 

4 

5 

8.32e-03 

0.96 

NSABP B-28 

5 

6 

2.81e-02 

0.99 

HeCOG 

6 

7 

6.28e-02 

1.01 

BAGS 01 


random effect model in Appendix we show simulations where N — 1 studies have 
effects Oi samples for the normal distribution with zero mean, and one study has effect 
fin £ {0; • • •; 5}. When fin = 0, the fraction of times the null is rejected at the nominal 
0.05 level using a t-test with N — 1 degrees of freedoms on the sample of N estimated 
effect sizes is about 0.05, and using the meta-analysis computations suggested in 
[Higgins et ah, 20TT] the fraction is at most 0.12. However, when fin > 0, the fraction 
of times the null is rejected at the nominal 0.05 level using a t-test with N—1 degrees of 
freedoms on the sample of N estimated effect sizes can be as high as 0.15, and using the 
meta-analysis computations suggested in [Higgins et ah, 20Tl] the fraction can reach 
almost 0.3. We conclude from these simulations that for meta-analysis, it is better to 
use the t-test, and that even with this non-liberal test the significant conclusion can 
be entirely driven from a single study. Therefore, a replicability analysis is necessary 
in order to rule out the possibility that a significant random effect meta-analysis 
conclusion is driven by a single study. 


In our two domains there were typically 1-4 primary endpoints per review. We briefly 
discussed ways to account for the multiplicity of primary endpoints in assessing repli¬ 
cability in Section [4^ We regard this as an extension since the emphasis, and the new 


contribution, of this paper is the introduction of the r-value into the meta-analysis 
conclusions. 
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A Examples for the influenza domain 


Our first example is based on a meta-analysis in review CD001269, analyzed by 
the authors as a random effect meta-analysis, which is non-sensitive and thus has 
a small r-value. The objective of Review CD001269 was to assess the effects of 
vaccines against influenza in healthy adult. Four studies were significant (all favoured 
treatment), and the remaining twelve studies had non-significant effects at the .05 
significance level (seven favoured the treatment, five favoured the control). When 
combined in a meta-analysis the evidence was significant, and the review conclusion 
was that the placebo arm had a higher risk of Influenza-like illness than the vaccine 
arm, see the left panel of Figure Even after removing each significant study (in 
particular the most influential: study number 9) there was still a significant effect in 
the meta-analysis at the 0.05 level, see the right panel of Figure [^ The r-value is 
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0.0014. The significant meta-analytic conclusion can therefore be accompanied by a 
statement that the replicability claim is established at the .05 level of significance. 
This is a stronger scientific claim than that of the meta-analysis, and it is supported 
by the data in this example. 

We suggest accompanying the original forest plot with this r-value, see Figure]^ Note 
that although the referred meta-analysis is significant and the replicability claim is 
established at the .05 level of significance, in the main results of Review CD001269 
the authors write : ” The overall effectiveness of parenteral inactivated vaccine against 
influenza-like illness (ILI) is limited, corresponding to a number needed to vaccinate 
(NNV) of 40 (95% confidence interval (Cl) 26 to 128)”. to this, we suggest adding the 
following information: “This result was replicated in more than one study (r-value = 
0.0014)”. The replicability claim is relevant even in the presence of a limited effect 
size: at least two studies showed that there is a (possibly limited) effect of parenteral 
inactivated vaccine against influenza-like illness (ILI) is limited. 
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Figure 6: The forest plot in Review CD001269 (Left) and excluding study 9 (Right). 
The r-value was 0.0014. The sensitivity interval, [0.75,0.96], is the confidence interval 
excluding study 9 (the black diamond in the right panel) with an additional (very 
small) left tail. The axis is on the logarithmic scale. 


Our second example, also from the influenza domain, is based on a meta-analysis 
in review CD001269, analyzed by the authors as a random effects meta-analysis. 
In this example the random effect meta-analysis was sensitive. The objective of 
Review CD008965 was to describe the potential benefits and harms of Neuraminidase 
inhibitors for influenza in all age groups. In the meta-analysis a significant finding 
was discovered, see the left panel of FigureNote that only one study was significant 
and the remaining seven studies were not significant (with large confidence intervals). 
After removing study number 1,there was no longer a significant effect in the meta¬ 
analysis, see the right panel of Figure]^ The r-value was 0.1206 based on the random 
effect meta-analysis computations as suggested in |Higgins et ah, 20TI], and 0.0661 
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Figure 7: The forest plot in the original Review CD001269 including the r-value, 
which was 0.0014. The asterisks indicate which study was excluded for the r-value 
computation. 

based on the meta-analysis computations as suggested in [IntHout et ah, 2014| . The 
replicability claim was not established at the .05 level of significance. This lack 
of replicability, quantified by the r-value, cautions practitioners that the significant 
meta-analysis finding may depend critically on a single study. 

We suggest accompanying the original forest plot with this r-value, see Figure In 
the main results of Review CD008965 the authors write ”In adults treatment tri¬ 
als, Oseltamivir significantly reduced self reported, investigator-mediated, unverified 
pneumonia (RR 0.55, 95% Cl 0.33 to 0.9)” ; To this, we suggest adding the following 
information: “We cannot rule out the possibility that this result is based on a single 
study (r-value = 0.1206)”. Note that the conclusion was not that complication were 
reduced, but this was due to lack of diagnostic definitions. The authors conclusion in 
this review was that ’’treatment trials with oseltamivir do not settle the question of 
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whether the complications of influenza (such as pneumonia) are reduced, because of 
a lack of diagnostic definitions”. 
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Figure 8: The forest plot in Review CD008965 (Left) and excluding study 1 (Right). 
The r-value was 0.1206. The sensitivity interval, [0.24, 1.13], is the confidence interval 
excluding study 1 (the black diamond in the right panel) with an additional (small) 
right tail. The axis is on the logarithmic scale. 


B Sensitivity analysis computation details 

Let and be, respectively, the left- and right- p-values from a meta¬ 

analysis on the subset (R,..., 4) C (1,..., N} of the N studies in the full meta¬ 
analysis, k < N. Let n(/c) denote the set of all possible subsets of size k. 


B.l The r-value computation 

For a meta-analysis based on N studies, a replicability claim is a claim that the 
conclusion remains significant (e.g., rejection of the null hypothesis of no treatment 
effect) using a meta-analysis of each of the subsets of V — u -|- 1 studies, where 

u = 2,..., V is a parameter chosen by the investigator. Specifically, for u = 2, a 
replicability claim is a claim that the conclusion remains significant using a meta¬ 
analysis of each of the N subsets of V — 1 studies. 

The r-value for replicability analysis, where we claim replicability if the conclusion 
remains significant using a meta-analysis of each of the subsets of N — u + 1 

studies is computed as follows. For left- sided alternative, the r-value is 

= max pf , . 
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Figure 9: The forest plot in the original Review CD008965 including the r-value. 


which was 0.1206. The asterisks indicate which study was excluded for the r-value 
computation. 


For right- sided alternative, the r-value is 


J.R _ 


max , 

(■ ■ I i\‘^^lv?^Ar — n+l 

(zi,...,ZAr-n+i)en(Ar-n+l) 


For two-sided alternavies, the r-value is 

r — 2 min(r'^, r^). 


B.2 Sensitivity analysis for confidence intervals 

The sensitivity interval is the union of all the meta-analysis conhdence intervals using 
the subsets of N — u + 1 studies. The upper limit of the (1 — a) sensitivity 

interval is the upper limit of the (1 — a) confidence interval from the meta-analysis 
on (ii,... where (if ,... is the subset that achieves the maximum 

p-value for the left-sided r-value computation. Similarly, the lower limit of the (1 —«) 
sensitivity interval is the lower limit of the {1 — a) confidence interval from the meta- 
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analysis on (if,.. . where (if,... ,if_„_,_i) is the subset that achieves the 

maximum p-value for the right-sided r-value computation. 

The meta-analysis is non-sensitive (at the desired value of u) if and only if the sensitiv¬ 
ity interval does not contain the null hypothesis value. This follows from the following 
argument. To see this, note that r < a, if and only if < a/2 or < a 12. Since 

< a 12 if and only if the upper limit of all the meta-analysis \ — a confidence 
intervals of subsets of size — u -|- 1 is below the null value, and < a 12 if and 

only if the lower limit of all the meta-analysis 1 — a confidence intervals of subsets of 
size — u -|- 1 is above the null value, the result follows. 


B.3 Leave-one-out sensitivity procedure 

For meta-analysis with N studies and significant effect size 0 < Oq ,where Oq is the 
null effect ,e.g., 1 for HR (two- sided alternative): 

1) Compute meta-analysis of each of the subsets of A^ — u -|- 1 studies. 

2) Choose the N — u + 1 subset of studies that achieves the maximum p-value for the 
left-sided r-value computation: (tf,..., 

3) Compute the two-sided r-value : 

r = 2 min(r'^, r^). 

4) If the r-value < 0.05, the replicability is established in at lease u studies. Otherwise, 
the replicability is established in at most u—1 studies (for u=2 , r-value > 0.05 means 
that the finding is not replicable). 

For meta-analysis with N studies and significant effect size 6 > 9q ,where 9q is the 
null effect ,e.g., 1 for HR (two- sided alternative): 

1) Compute meta-analysis of each of the subsets of A^ — u -h 1 studies. 

2) Choose the N — u + 1 subset of studies that achieves the maximum p-value for the 
right-sided r-value computation: (zf,..., 

3) Compute the two-sided r-value : 

r — 2 min(r'^, r^). 

4) If the r-value < 0.05, the replicability is established in at lease u studies. Otherwise, 
the replicability is established in at most u—1 studies (for u=2 , r-value > 0.05 means 
that the finding is not replicable). 


C Random effects meta analysis simulation 


Using the following simulation we demonstrate that a significant random effect meta¬ 
analysis is not equivalent to replicability. Meaning, random effect meta-analysis can 
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be significant even though the effect is greater than zero in only one study out of N. 
We show that the probability of rejecting the null hypothesis with a single outlying 
study can be as high as 6 times the nominal level using the meta-analysis computa¬ 
tions of [Higgins et ah, 20TT] , and as high as 3 times the nominal level using the more 


conservative approach of [IntHout et ah, 2014| . 


For N G {3,5,7,9,20} studies, we sampled N — 1 effects — 1 from 

the distribution A1(0,t^), where G {0.01,0.04,0.09,0.25,0.49,1}. For the Nth. 
study, the effect was /r„ G {0, 0.05,0.1,..., 4.5, 5}. For each study i G {1, ..., N}, we 
sampled observed effects 9i from the normal distribution with mean /Xj and standard 
deviation 0.01. We computed the random effect meta-analysis one-sided p-value us¬ 
ing the computations suggested in [Higgins et ah, 20TT] , i.e., using the 2 ;-test on the 
average observed effects, as well as using the t-test on the sample of observed effects. 
We estimated the probability of rejection the null hypothesis of zero mean based on 
10^ iterations. 


Figures [T^ and show the resulting estimated probability of rejecting the null hy¬ 
pothesis. The random effect meta-analysis is significant in more than 5% of the 
iterations for all N in values of piv > 0 that are not too large relative to the value of 
T. The larger the value of r^, the greater the range of /ijv > 0 for which the nominal 
level of significance is not maintained. 


In [Higgins et ah, 201 1[ , the normal distribution is used for the random effect meta¬ 
analysis p-value, instead of the t-distribution with N — 1 degrees of freedom which in 
our simulation (with equal study weights) results in an exact a = 0.05 level test when 
/ijv = 0. We see that the usage of the .s-test instead of the t-test results in a type 
I error rate substantially greater than 5% under the null hypothesis (i.e. piv = 0) 
and in a higher rejection rate of the null hypothesis for > 0 in comparison to the 
fraction of rejections using the t-test when there is no replicability. 
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Figure 10: The average fraction of rejections at the 0.05 level, using the .s-test detailed 
in page 74 of [Higgins et ah, 20TT] : (a) = 0.01; (b) = 0.04; (c) = 0.09; (d) 

= 0.25; (e) = 0.49; (f) = 1. The probability of the type I error is above 0.05 

for all Ns and rs, ranging from about < 12% for N = 3 and decreasing to 0.055 for 
= 20. The maximum fraction of rejections is 0.26 for = 3 and = 0.01, and 
decreases for increasing N and r. The range of values of /Ujv for which it is above 
5% increases with and with N. Even if the 2 :-test is acceptable for meta-analysis 
when the number of studies is large enough, we still have a problem with lack of 
replicability. 
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Figure 11: The average fraction of rejections at the 0.05 level, using the t-test: (a) 
= 0.01; (b) = 0.04; (c) = 0.09; (d) = 0.25; (e) = 0.49; (f) = 1. 

The probability of the type I error is 0.05 for all Ns and rs. The maximum fraction 
of rejections is 0.15 for N = 3 and = 0.01, and decreases for increasing N and r. 
The range of values oi for which it is above 5% increases with as well as with 
N. Even though the t-test controls the probability of type I error for meta-analysis, 
we still have a problem with lack of replicability. 
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